In [12]:
# TODO from insights by prof
# dont save in json but locally in a var
# only match by labels (no other matching needed, no properties, superclasses, etc.)
# just save labels in list with key is class uri and value is label: if memory is not a problem save in to structures: key value of label and value is class uri
# 5 - 10 min for human, mouse ontologies is reasonable

# Output: in rdf format (see reference_anatomy) just put final matches in there above the threshold (defined by the user). As relation use owl:equivalentClass

In [13]:
# Required libraries
!pip install sentence_transformers
!pip install pandas
!pip install rdflib
!pip install python-Levenshtein
!pip install scikit-learn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip

# Project Description
The goal of the project is to develop a simple yet effective ontology alignment framework in Python that focuses on lexical similarity matching. The framework will utilize both string matching techniques and the semantic capabilities of large language models to identify potential alignments between entities (such as classes) in two different ontologies.

### Objectives
1. **Develop an ontology alignment framework** that can process and compare ontologies based on textual content.
2. **Implement lexical similarity matching** using both basic string matching techniques and advanced semantic analysis with embeddings from LLMs.
3. **Output alignments with confidence scores**, enabling users to understand and evaluate the quality and reliability of the suggested alignments.

### Steps to Perform

#### Step 1: Ontology Parsing
- **Goal**: Load and parse the ontologies to be aligned.
- **Tasks**:
  - Utilize libraries like `rdflib` or `owlready2` to read ontology files.
  - Extract relevant textual information (e.g., class names, labels, descriptions).

#### Step 2: Lexical Similarity Matching
This step is divided into two sub-steps: string matching and embeddings matching.

##### a. String Matching
- **Goal**: Implement direct and fuzzy string comparison techniques to find matches based on textual similarity.
- **Tasks**:
  - Perform normalization (e.g., lowercasing, removing special characters).
  - Use string comparison methods (exact match, substring search, edit distance).

##### b. Embeddings Matching Using LLMs
- **Goal**: Use the semantic context provided by LLMs to match terms based on their meanings.
- **Tasks**:
  - Generate embeddings for the textual content of each ontology using models from the Hugging Face Transformers library.
  - Calculate similarity scores between embeddings (e.g., using cosine similarity).

#### Step 3: Combining and Filtering
- **Goal**: Aggregate results from both matching techniques and refine the output.
- **Tasks**:
  - Combine scores from string and embeddings matching.
  - Apply thresholds to filter out matches with low confidence.
  - Optionally, use simple structural checks to add confidence to matches (e.g., matched entities have similar parent classes).

#### Step 4: Output and Evaluation
- **Goal**: Output the alignment results and provide means for evaluation.
- **Tasks**:
  - Format the output in a structured way (e.g., JSON, CSV) that lists entity pairs and their matching scores.
  - If possible, evaluate the effectiveness using known benchmarks or test cases to calculate precision, recall, and F1-score.

### Summary
The project is centered on creating a practical tool for ontology matching, focusing on textual content using both conventional and advanced NLP techniques. By combining string-based and semantic-based approaches, the framework aims to provide robust alignments that are supported by both literal and contextual text similarities. This dual approach enhances the capability of the alignment process, making it more flexible and potentially more accurate than using only one method.

In [2]:
# imports
import json
import rdflib
import pandas as pd
from collections import OrderedDict, defaultdict

**rdflib vs owlready2:**

Interchangeability: Given that OWL is an application of RDF, tools that can parse RDF/XML can generally handle .owl files, and vice versa, provided that the ontology-specific constructs are understood by the tool. This is why libraries like rdflib, which are capable of parsing RDF, are suitable for handling OWL files serialized in RDF/XML format.

Flexibility: Choosing to work with rdflib for general RDF handling and owlready2 for specific ontology manipulations where needed is a flexible approach. It allows you to leverage the strengths of both libraries—rdflib for its robust RDF manipulation and SPARQL querying capabilities, and owlready2 for its ontology-specific features like reasoning and direct manipulation of classes and properties.

# Load/ Parse ontologies

In [3]:
# input paths
onto1_path_in = "test_ontologies/mouse.owl"
onto2_path_in = "test_ontologies/human.owl"


In [4]:
def load_ontology(file_path):
    """
    Loads an ontology from a given file path, which can be in RDF (.rdf) or OWL (.owl) format.

    Args:
    file_path (str): The file path to the ontology file.

    Returns:
    rdflib.Graph: A graph containing the ontology data.
    """
    # Create a new RDF graph
    graph = rdflib.Graph()

    # Bind some common namespaces to the graph
    namespaces = {
        "rdf": rdflib.namespace.RDF,
        "rdfs": rdflib.namespace.RDFS,
        "owl": rdflib.namespace.OWL,
        "xsd": rdflib.namespace.XSD
    }
    for prefix, namespace in namespaces.items():
        graph.namespace_manager.bind(prefix, namespace)

    # Attempt to parse the file
    try:
        graph.parse(file_path, format=rdflib.util.guess_format(file_path))
        print(f"Successfully loaded ontology from {file_path}")
    except Exception as e:
        print(f"Failed to load ontology from {file_path}: {e}")
        return None

    return graph

In [5]:
# load ontologies
onto1_graph = load_ontology(onto1_path_in)
onto2_graph = load_ontology(onto2_path_in)
print(onto1_graph, onto2_graph)

Successfully loaded ontology from test_ontologies/mouse.owl
Successfully loaded ontology from test_ontologies/human.owl
[a rdfg:Graph;rdflib:storage [a rdflib:Store;rdfs:label 'IOMemory']]. [a rdfg:Graph;rdflib:storage [a rdflib:Store;rdfs:label 'IOMemory']].


In [6]:
def preprocess_label(label):
    return str(label).replace("_", " ").strip(" ,.").lower()

### New approach without json and instead dicts

In [7]:
def extract_ontology_details_to_dict(graph):
    # Query for classes
    class_query = """
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    SELECT ?class ?label ?label_dt ?label_lang
    WHERE {
        ?class rdf:type owl:Class.
        OPTIONAL { ?class rdfs:label ?label. BIND(datatype(?label) AS ?label_dt) BIND(lang(?label) AS ?label_lang) }
    }
    """
    classes = graph.query(class_query)
    ontology_labels_dict = OrderedDict()
    labels_list = []

    # Process class results
    for row in classes:
        class_uri, label, label_dt, label_lang = row
        class_key = str(class_uri)
        label_str = preprocess_label(label)

        if label_str is None or label_str == "none":
            continue

        if label_str not in ontology_labels_dict:
            ontology_labels_dict[label_str] = class_key
            labels_list.append(label_str)

    return ontology_labels_dict, labels_list

In [8]:
onto1_dict, onto1_list = extract_ontology_details_to_dict(onto1_graph)
onto2_dict, onto2_list = extract_ontology_details_to_dict(onto2_graph)

In [9]:
onto1_dict

OrderedDict([('pial vein', 'http://mouse.owl#MA_0002196'),
             ('foot nerve', 'http://mouse.owl#MA_0000653'),
             ('extensor carpi ulnaris', 'http://mouse.owl#MA_0002292'),
             ('lower urinary tract', 'http://mouse.owl#MA_0002636'),
             ('tongue', 'http://mouse.owl#MA_0000347'),
             ('anterior tegmental nucleus', 'http://mouse.owl#MA_0001055'),
             ('hair shaft', 'http://mouse.owl#MA_0000159'),
             ('abdomen/pelvis/perineum blood vessel',
              'http://mouse.owl#MA_0000524'),
             ('perineum', 'http://mouse.owl#MA_0002466'),
             ('tail intervertebral disc', 'http://mouse.owl#MA_0000698'),
             ('thoracic cavity connective tissue',
              'http://mouse.owl#MA_0000555'),
             ('hindlimb connective tissue', 'http://mouse.owl#MA_0000661'),
             ('superior oblique extraocular muscle',
              'http://mouse.owl#MA_0001277'),
             ("peyer's patch epithelium", 'h

In [10]:
onto1_list

['pial vein',
 'foot nerve',
 'extensor carpi ulnaris',
 'lower urinary tract',
 'tongue',
 'anterior tegmental nucleus',
 'hair shaft',
 'abdomen/pelvis/perineum blood vessel',
 'perineum',
 'tail intervertebral disc',
 'thoracic cavity connective tissue',
 'hindlimb connective tissue',
 'superior oblique extraocular muscle',
 "peyer's patch epithelium",
 'rib 2',
 'submental vein',
 'urethra gland',
 'loop of henle descending limb',
 'hand phalanx',
 'artery smooth muscle',
 'facial vii nerve chorda tympani branch',
 'esophagus wall',
 'tear',
 'ear',
 'sternebra',
 'carpal bone',
 'foot skin',
 'skin sebaceous gland',
 'muzzle/snout',
 'pelvis nerve',
 'pons',
 'cranial nerve',
 'subparaventricular zone',
 'anal region smooth muscle',
 'renal column',
 'inner ear epithelium',
 'heart atrium',
 'lacrimal apparatus',
 'costal arch',
 'maxillary vein',
 'lateral cerebellar nucleus',
 'hepatic artery',
 'red nucleus',
 'esophagus',
 'spinal ganglion',
 'medial preoptic region',
 'audito

In [11]:
# def find_duplicates(ordered_dict): # => filtering for multiple labels works => see outdated notebook

## 3. Matching

### 3.1. String Matching

For String Matching we will implement 4 different methods that the user then can chose via a parameter when calling the method.

The metrics we will use are:
- Levenshtein distance
- Jaccard Similarity
- Cosine Similarity
- TF-IDF

In [12]:
import Levenshtein
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_score

def levenshtein_distance(str1, str2):
    return Levenshtein.distance(str1, str2)

def calc_cosine_similarity(str1, str2):
    vectorizer = CountVectorizer()
    count_matrix = vectorizer.fit_transform([str1, str2])
    return cosine_similarity(count_matrix)[0][1]

def jaccard_similarity(str1, str2):
    # Tokenize the strings into sets of words
    set1 = set(str1.split())
    set2 = set(str2.split())

    # Find the intersection and union of the two sets
    intersection = set1.intersection(set2)
    union = set1.union(set2)

    # Calculate the Jaccard score
    if not union:  # Handle the edge case where both strings might be empty
        return 0.0
    return len(intersection) / len(union)

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def cosine_vectorize_labels(labels):
    """
    Converts a list of labels into TF-IDF vectors using TfidfVectorizer.

    Args:
    labels (list): List of all labels from both ontologies.

    Returns:
    TfidfVectorizer, scipy.sparse.csr.csr_matrix: The vectorizer and the TF-IDF matrix.
    """
    vectorizer = CountVectorizer()
    count_matrix = vectorizer.fit_transform(labels)
    return vectorizer, count_matrix

def cosine_compare_labels(count_matrix, index1, index2):
    """
    Computes the cosine similarity between two labels based on their count vector indices.

    Args:
    count_matrix (scipy.sparse.csr.csr_matrix): The matrix containing the count vectors.
    index1, index2 (int): Indices of the labels to compare.

    Returns:
    float: Cosine similarity score.
    """
    return cosine_similarity(count_matrix[index1:index1+1], count_matrix[index2:index2+1])[0][0]


def execute_cosine_string_matching(label_list1, label_list2):
    # Combine labels and vectorize them
    all_labels = label_list1 + label_list2
    vectorizer, count_matrix = cosine_vectorize_labels(all_labels)

    # Example comparison between the first label of ontology 1 and the first label of ontology 2
    similarity_score = cosine_compare_labels(count_matrix, 0, len(label_list1))
    print(f"Similarity score between '{label_list1[0]}' and '{label_list2[0]}': {similarity_score}")

In [14]:
execute_cosine_string_matching(onto1_list, onto2_list)

Similarity score between 'pial vein' and 'vestibular hair cell': 0.0


In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def tfidf_vectorize_labels(labels):
    """
    Converts a list of labels into TF-IDF vectors using TfidfVectorizer.

    Args:
    labels (list): List of all labels from both ontologies.

    Returns:
    TfidfVectorizer, scipy.sparse.csr.csr_matrix: The vectorizer and the TF-IDF matrix.
    """
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(labels)
    return vectorizer, tfidf_matrix

def tfidf_compare_labels(tfidf_matrix, index1, index2):
    """
    Computes the cosine similarity between two labels based on their TF-IDF vector indices.

    Args:
    tfidf_matrix (scipy.sparse.csr.csr_matrix): The matrix containing the TF-IDF vectors.
    index1, index2 (int): Indices of the labels to compare.

    Returns:
    float: Cosine similarity score.
    """
    return cosine_similarity(tfidf_matrix[index1:index1+1], tfidf_matrix[index2:index2+1])[0][0]


def execute_tfidf_string_matching(label_list1, label_list2):
    # Combine labels and vectorize them
    all_labels = label_list1 + label_list2
    vectorizer, tfidf_matrix = tfidf_vectorize_labels(all_labels)

    # Example comparison between the first label of ontology 1 and the first label of ontology 2
    similarity_score = tfidf_compare_labels(tfidf_matrix, 0, len(label_list1))
    print(f"Similarity score between '{label_list1[0]}' and '{label_list2[0]}': {similarity_score}")

In [16]:
execute_tfidf_string_matching(onto1_list, onto2_list)

Similarity score between 'pial vein' and 'vestibular hair cell': 0.0


In [17]:
def execute_string_matching(metric, data1, data2):
    """
    Executes the selected matching metric on the provided data.

    Args:
    metric (str): A single letter representing the metric to use.
                  'Levenshtein' for Levenshtein Distance,
                  'Jaccard' for Jaccard Similarity,
                  'LinkTransformer' for Link Transformer.
    data1, data2 (str): The data strings to compare.

    Returns:
    result: The result of the chosen metric computation.
    """
    if metric == 'Levenshtein':
        return levenshtein_distance(data1, data2)
    elif metric == 'Jaccard':
        return jaccard_similarity(data1, data2)
    elif metric == 'LinkTransformer':
        pass # TODO implement or remove
    else:
        raise ValueError("Invalid metric selection")

In [18]:
len(onto1_dict)

2738

In [19]:
# load ontologies
onto1_graph = load_ontology(onto1_path_in)
onto2_graph = load_ontology(onto2_path_in)

onto1_dict, onto1_list = extract_ontology_details_to_dict(onto1_graph)
onto2_dict, onto2_list = extract_ontology_details_to_dict(onto2_graph)

print(len(onto1_list))
print(len(onto2_list))

Successfully loaded ontology from test_ontologies/mouse.owl
Successfully loaded ontology from test_ontologies/human.owl
2738
3298


In [74]:
def match_ontologies(onto1_dict, onto1_list, onto2_dict, onto2_list, metric, bidirectional=False):
    labels_already_tested_labels = {} # dict to store when labels (of ontology 2) were already tested for label (of ontology 1) => necessary to avoid infite loop

    for label in onto1_list:
        labels_already_tested_labels[label] = []

    onto2_used_classes = {}

    if metric == "Cosine":
        execute_cosine_string_matching(onto1_list, onto2_list)
        # TODO extend so it return scores etc. and also can be mapped to nodes for next step
    elif metric == "TF-IDF":
        execute_tfidf_string_matching(onto1_list, onto2_list)
        # TODO extend so it return scores etc. and also can be mapped to nodes for next step
    else:
        class_results = {}

        while onto1_list: # loop over labels of ontology 1 until empty
            #print(len(onto1_list))
            label1 = onto1_list.pop() # remove the last element in the list => removing the last (instead of first) makes things easier and less error prone
            # labels that got added again cause a better match was found (see later step) will be appended to the end and therefore handled immediately

            # Match from Ontology 1 to Ontology 2
            label_result = [label1, "", "", 0]
            best_score = 0
            already_tested_labels = labels_already_tested_labels[label1]
            for label2 in onto2_list:
                if label2 not in already_tested_labels: # check that label wasn't already checked in previous run
                    matching_score = execute_string_matching(metric, label1, label2) # calculate string matching score
                    # If a perfect match is found, stop iterating over labels for this entry
                    if (metric == 'Jaccard' and matching_score == 1) or (metric == 'Levenshtein' and matching_score == 0): # handle perfect match
                        best_score = matching_score
                        label_result = [label1, "", label2, best_score]
                        break # stop searching for matches cause perfect match found
                    # Check if a match for this label has been found before
                    if (metric == 'Jaccard' and matching_score > best_score) or (metric == 'Levenshtein' and matching_score < best_score): # handle higher score then before
                        label_result[2] = label2
                        best_score = matching_score

            label_result[3] = best_score # save best score in label_result
            label_with_best_score = label_result[2] # get label that achieved the best score

            class_uri = onto1_dict[label1] # get the class_uri of the currently checked label in ontology 1
            if label_result[3] == 0 and label_with_best_score == '': # handle if no match was found
                class_results[class_uri] = label_result
            else:
                class2_uri = onto2_dict[label_with_best_score] # get class_uri of the label with best match
                label_result[1] = class2_uri # save class_uri instead of label => TODO maybe change to not manipulate label_result as it is confusing for later steps

                if class2_uri not in onto2_used_classes: # check if class found of ontology 2 is NOT already used by other class in ontology 1
                    if class_uri not in class_results: # handle no entry exists yet for that class
                        class_results[class_uri] = label_result
                        onto2_used_classes[class2_uri] = class_uri
                        labels_already_tested_labels[label1].append(label_with_best_score)
                    elif label_result[3] > class_results[class_uri][3]: # handle entry exist but now higher score was found with another label of the class (handles multiple labels)
                            class_results[class_uri] = label_result
                            onto2_used_classes[class2_uri] = class_uri
                            labels_already_tested_labels[label1].append(label_with_best_score)
                else: # class of ontology 2 already in use
                    result_current_class_in_use = onto2_used_classes[class2_uri] # get class uri of class that uses that class of ontology 2
                    if label_result[3] > class_results[result_current_class_in_use][3]: # if score of the new found match is higher than the current assigned one
                        class_results[class_uri] = label_result # set the class of ontology 2 to that current class
                        onto2_used_classes[class2_uri] = class_uri # overwrite the use of that class to new class of ontology 1
                        old_used_label = class_results[result_current_class_in_use][0]
                        labels_already_tested_labels[old_used_label].append(label_with_best_score)
                        onto1_list.append(old_used_label) # add the old used label again to list again that gets iterated as it now doesnt have a match anymore
                        class_results[result_current_class_in_use] = ["", "", "", 0] # set result of earlier class to None (could also be remmoved but that way later we can handle if no match found)
                    else: # handle not a higher score
                        labels_already_tested_labels[label1].append(label_with_best_score) # add the label to the already_tested_labels
                        onto1_list.append(label1) # append currently check label again as it needs to handled again with the new information of already_tested_labels

        # TODO implement bidirectional matching

        # OLD_TO-DO important: currently it takes the best match found for the current class of the ontology 1.
        # But it doesnt take into account if a later class has a higher score with that class and therefore would be better suited
        # Solution: make dict and always take one element that gets removed. If later another element matches with the class matched with
        # the previous element the earlier removed element get added again and the value of that elements gets assigned to the higher value element

        return class_results


In [75]:
# input paths
onto1_path_in = "test_ontologies/mouse.owl"
onto2_path_in = "test_ontologies/human.owl"

onto1_graph = load_ontology(onto1_path_in)
onto2_graph = load_ontology(onto2_path_in)

onto1_dict, onto1_list = extract_ontology_details_to_dict(onto1_graph)
onto2_dict, onto2_list = extract_ontology_details_to_dict(onto2_graph)

string_matching_results = match_ontologies(onto1_dict, onto1_list, onto2_dict, onto2_list, 'Jaccard')

Successfully loaded ontology from test_ontologies/mouse.owl
Successfully loaded ontology from test_ontologies/human.owl


In [53]:
print(len(onto1_dict))
print(len(string_matching_results))
# TODO one element is always missing: length of classes in onto_1 is 2739 but resulting matches is 2738. Same by LLM

2738
2737


In [23]:
# Comparing with reference_anatomy where matches all have measure of 1
print(string_matching_results["http://mouse.owl#MA_0000110"])
print(string_matching_results["http://mouse.owl#MA_0001468"])
print(string_matching_results["http://mouse.owl#MA_0001886"])
print(string_matching_results["http://mouse.owl#MA_0000702"]) # has 0.75
print(string_matching_results["http://mouse.owl#MA_0000751"]) # has 0.8
print(string_matching_results["http://mouse.owl#MA_0000064"])
print(string_matching_results["http://mouse.owl#MA_0000062"])
print(string_matching_results["http://mouse.owl#MA_0002497"]) # has 0.66
print(string_matching_results["http://mouse.owl#MA_0001221"])
print(string_matching_results["http://mouse.owl#MA_0001424"]) # has 0.25 => check why so low and for which class then the correct label was used: http://human.owl#NCI_C32242

['intervertebral disc', 'http://human.owl#NCI_C49571', 'intervertebral disc', 1.0]
['occipital bone', 'http://human.owl#NCI_C12757', 'occipital bone', 1.0]
['external ear cartilage', 'http://human.owl#NCI_C49225', 'external ear cartilage', 1.0]
['aorta smooth muscle', 'http://human.owl#NCI_C49191', 'aorta smooth muscle tissue', 0.75]
['lymphatic vessel smooth muscle', 'http://human.owl#NCI_C49260', 'lymphatic vessel smooth muscle tissue', 0.8]
['artery', 'http://human.owl#NCI_C12372', 'artery', 1.0]
['aorta', 'http://human.owl#NCI_C12669', 'aorta', 1.0]
['liver perisinusoidal space', 'http://human.owl#NCI_C33309', 'perisinusoidal space', 0.6666666666666666]
['tensor tympani', 'http://human.owl#NCI_C33748', 'tensor tympani', 1.0]
['cervical vertebra 4', 'http://human.owl#NCI_C33723', 't12 vertebra', 0.25]


In [24]:
import itertools
import time
from sentence_transformers import SentenceTransformer, util

def calculate_label_similarity_llm(model_name, onto1_dict, onto2_dict):
  """
  Calculate cosine similarity between pairs of labels from two sets and return the results in a dictionary.
  Each key in the dictionary is the class URI from ontology 1, and each value is a list of tuples,
  each containing the label from ontology 2, the class URI from ontology 2, and the similarity score.

  Parameters:
  model_name (str): Name of the Sentence Transformer model to be used.
  onto1_dict (OrderedDict): Dictionary where keys are labels and values are class URIs for the first ontology.
  onto2_dict (OrderedDict): Dictionary where keys are labels and values are class URIs for the second ontology.

  Returns:
  dict: A dictionary with class URIs from the first ontology as keys and lists of tuples (label, class URI, score) from the second ontology as values.
  """
  model = SentenceTransformer(model_name, device='cpu')
  onto1_labels, onto1_classes = zip(*onto1_dict.items())
  onto2_labels, onto2_classes = zip(*onto2_dict.items())

  onto1_label_embeddings = model.encode(list(onto1_labels), convert_to_tensor=True)
  onto2_label_embeddings = model.encode(list(onto2_labels), convert_to_tensor=True)

  similarity_scores = util.pytorch_cos_sim(onto1_label_embeddings, onto2_label_embeddings)

  # Initialize the dictionary to hold results
  results_dict = {}

  # Fill the dictionary with similarity scores
  for i, onto1_class in enumerate(onto1_classes):
    results_dict[onto1_class] = {}
    for j, onto2_class in enumerate(onto2_classes):
      results_dict[onto1_class][onto2_class] = similarity_scores[i][j].item()

  # Sort the dictionary entries by similarity score within each onto1_class
  sorted_results_dict = {}
  for onto1_class in results_dict:
    sorted_onto2_classes = sorted(results_dict[onto1_class].items(), key=lambda x: x[1], reverse=True)
    sorted_results_dict[onto1_class] = dict(sorted_onto2_classes)

  return sorted_results_dict

# Record start time
start_time = time.time()

dict_similarity_scores_llm = calculate_label_similarity_llm("sentence-transformers/all-MiniLM-L12-v2", onto1_dict, onto2_dict)

# Record end time
end_time = time.time()

# Calculate elapsed time
elapsed_time = end_time - start_time

print(f"Time taken: {elapsed_time} seconds")

  from .autonotebook import tqdm as notebook_tqdm


Time taken: 33.46857190132141 seconds


In [25]:
# Print the first entries from the dictionary
for onto1_class in list(dict_similarity_scores_llm.keys())[:20]:
    print(f"{onto1_class}: {dict_similarity_scores_llm[onto1_class]}")

http://mouse.owl#MA_0001458: {'http://human.owl#NCI_C12933': 0.7486364841461182, 'http://human.owl#NCI_C12744': 0.725203275680542, 'http://human.owl#NCI_C12852': 0.7156738042831421, 'http://human.owl#NCI_C12693': 0.7074766755104065, 'http://human.owl#NCI_C12998': 0.6962596774101257, 'http://human.owl#NCI_C32334': 0.6928917169570923, 'http://human.owl#NCI_C49577': 0.6913926601409912, 'http://human.owl#NCI_C33868': 0.6775718331336975, 'http://human.owl#NCI_C12853': 0.6638357043266296, 'http://human.owl#NCI_C33869': 0.661840558052063, 'http://human.owl#NCI_C12819': 0.659957766532898, 'http://human.owl#NCI_C32902': 0.6418956518173218, 'http://human.owl#NCI_C32174': 0.6415303945541382, 'http://human.owl#NCI_C33730': 0.6409378051757812, 'http://human.owl#NCI_C32901': 0.6284998059272766, 'http://human.owl#NCI_C53145': 0.6284353733062744, 'http://human.owl#NCI_C33501': 0.624455988407135, 'http://human.owl#NCI_C33731': 0.6227028965950012, 'http://human.owl#NCI_C12798': 0.6220874190330505, 'http

In [26]:
def set_new_match(class1_uri, class2_uri, score, onto2_used_classes, class_results):
    """Set a new match for class1_uri and class2_uri."""
    onto2_used_classes[class2_uri] = class1_uri
    class_results[class1_uri] = ["", class2_uri, "", score]  # label is empty for now

def update_matching(new_class1_uri, class2_uri, new_score, old_class1_uri, onto2_used_classes, class_results, onto1_class_list):
    """Update matches when a better score is found."""
    onto2_used_classes[class2_uri] = new_class1_uri
    class_results[old_class1_uri] = ["", "", "", 0]  # Clear old match
    class_results[new_class1_uri] = ["", class2_uri, "", new_score]
    onto1_class_list.append(old_class1_uri)

def reevaluate(class1_uri, class2_uri, already_tested_classes, onto1_class_list):
    """Re-add class1_uri for re-evaluation."""
    onto1_class_list.append(class1_uri)
    already_tested_classes[class1_uri].add(class2_uri)

def perform_matching_llm(dict_similarity_scores_llm):
    # Initialize dictionaries and lists
    already_tested_classes = {}
    class_results = {}
    onto2_used_classes = {}
    onto1_class_list = list(dict_similarity_scores_llm.keys())

    while onto1_class_list:
        class1_uri = onto1_class_list.pop()
        already_tested_classes[class1_uri] = already_tested_classes.get(class1_uri, set())

        # Iterate over each class2_uri and score from the pre-sorted dictionary
        for class2_uri, score in dict_similarity_scores_llm[class1_uri].items():
            if class2_uri not in already_tested_classes[class1_uri]:
                already_tested_classes[class1_uri].add(class2_uri)  # Mark this class2_uri as tested

                if score >= 0.99:  # Check for a perfect match
                    set_new_match(class1_uri, class2_uri, score, onto2_used_classes, class_results)
                    break  # Found a perfect match, skip further checks for this class1_uri

                # If no perfect match, check if it's not already linked
                if class2_uri not in onto2_used_classes:
                    set_new_match(class1_uri, class2_uri, score, onto2_used_classes, class_results)
                    break  # Successfully linked, no need to continue

                # If already linked, check if the new score is better
                elif score > class_results[onto2_used_classes[class2_uri]][3]:
                    old_class1_uri = onto2_used_classes[class2_uri]
                    update_matching(class1_uri, class2_uri, score, old_class1_uri, onto2_used_classes, class_results, onto1_class_list)
                    break  # Updated the link, no need to continue
            else:
                # This class2_uri was already checked, continue to the next
                continue

    return class_results

# Call function
matched_results_llm = perform_matching_llm(dict_similarity_scores_llm)

In [96]:
# Print the first entries from the dictionary
for onto1_class in list(matched_results_llm.keys())[:20]:
    print(f"{onto1_class}: {matched_results_llm[onto1_class]}")

http://mouse.owl#MA_0002754: ['neocortex', 'http://human.owl#NCI_C33714', 'sylvian cistern', 0.3939228355884552]
http://mouse.owl#MA_0002530: ['pericardial fluid', 'http://human.owl#NCI_C38662', 'pericardial cavity', 0.6856664419174194]
http://mouse.owl#MA_0000527: ['abdomen/pelvis/perineum muscle', 'http://human.owl#NCI_C33300', 'perineal muscle', 0.7841752171516418]
http://mouse.owl#MA_0002069: ['ulnar artery', 'http://human.owl#NCI_C12839', 'ulnar artery', 0.9999997019767761]
http://mouse.owl#MA_0001962: ['gastroepiploic artery', 'http://human.owl#NCI_C52857', 'gastroepiploic artery', 1.0000001192092896]
http://mouse.owl#MA_0000060: ['blood vessel', 'http://human.owl#NCI_C12679', 'blood vessel', 1.0000001192092896]
http://mouse.owl#MA_0000518: ['abdomen blood vessel', 'http://human.owl#NCI_C34013', 'radial artery of the endometrium', 0.563804030418396]
http://mouse.owl#MA_0000736: ['cervical lymph node', 'http://human.owl#NCI_C32298', 'cervical lymph node', 0.9999998211860657]
http:

In [28]:
def transform_dict(original_dict):
    """
    Transforms a dictionary where URLs are the values into a dictionary where
    URLs are the keys and the values are concatenated labels associated with each URL.

    Args:
    original_dict (OrderedDict): The original dictionary with labels as keys and URLs as values.

    Returns:
    dict: A dictionary with URLs as keys and concatenated labels as values.
    """
    new_dict = {}
    for label, url in original_dict.items():
        if url in new_dict:
            new_dict[url] += " " + label  # Concatenating labels with space; change as needed
        else:
            new_dict[url] = label
    return new_dict

onto1_transformed_dict = transform_dict(onto1_dict)
onto2_transformed_dict = transform_dict(onto2_dict)

In [29]:
# Add labels
def add_labels(data, onto1_label, onto2_label):
    # Iterate through each key and update the list with labels
    for key, values in data.items():
        # First empty string in the list gets replaced by the label from onto1_label if available
        label1 = onto1_label.get(key)
        if label1 is not None:
            values[0] = label1

        # Second empty string in the list gets replaced by the label from onto2_label using the second entry's key if available
        label2_key = values[1]  # The second entry in the list is assumed to be a key for the onto2_label
        label2 = onto2_label.get(label2_key)
        if label2 is not None:
            values[2] = label2

    return data

matched_results_llm_with_labels = add_labels(matched_results_llm, onto1_transformed_dict, onto2_transformed_dict)

In [97]:
counter = 0
# Print the first entries from the dictionary
for onto1_class in list(matched_results_llm_with_labels.keys()):
    if matched_results_llm_with_labels[onto1_class][3] < 0.6:
        counter += 1
        print(f"{onto1_class}: {matched_results_llm_with_labels[onto1_class]}")
        
counter

http://mouse.owl#MA_0002754: ['neocortex', 'http://human.owl#NCI_C33714', 'sylvian cistern', 0.3939228355884552]
http://mouse.owl#MA_0000518: ['abdomen blood vessel', 'http://human.owl#NCI_C34013', 'radial artery of the endometrium', 0.563804030418396]
http://mouse.owl#MA_0000899: ['cortical layer iii', 'http://human.owl#NCI_C33143', 'multipolar neuron', 0.5624246001243591]
http://mouse.owl#MA_0001691: ['urethra smooth muscle', 'http://human.owl#NCI_C33637', 'striated muscle tissue cell', 0.4885055720806122]
http://mouse.owl#MA_0000112: ['nucleus pulposus', 'http://human.owl#NCI_C53038', 'spinalis thoracis', 0.4734574854373932]
http://mouse.owl#MA_0000621: ['hand digit blood vessel', 'http://human.owl#NCI_C52776', 'hand digit 2 phalanx', 0.5775056481361389]
http://mouse.owl#MA_0001072: ['nucleus of darkschewitsch', 'http://human.owl#NCI_C33171', 'node of bizzozero', 0.35910701751708984]
http://mouse.owl#MA_0000955: ['hippocampus molecular layer', 'http://human.owl#NCI_C13079', 'neural 

484

In [98]:
counter = 0
# Print the first entries from the dictionary
for onto1_class in list(string_matching_results.keys())[:20]:
    print(f"{onto1_class}: {string_matching_results[onto1_class]}")
    # if string_matching_results[onto1_class][0] == "neocortex":
    #     print(f"{onto1_class}: {string_matching_results[onto1_class]}")
    # if string_matching_results[onto1_class][3] == 0:
    #     counter += 1
        # print(f"{onto1_class}: {string_matching_results[onto1_class]}")
        
counter

http://mouse.owl#MA_0001230: ['extrinsic auricular muscle', 'http://human.owl#NCI_C32861', 'internal pterygoid muscle', 0.2]
http://mouse.owl#MA_0000052: ['upper leg', 'http://human.owl#NCI_C12671', 'upper extremity', 0.3333333333333333]
http://mouse.owl#MA_0000951: ['hippocampus ca2', 'http://human.owl#NCI_C32247', 'ca2 field of the cornu ammonis', 0.14285714285714285]
http://mouse.owl#MA_0000317: ['chondrocranium', '', '', 0]
http://mouse.owl#MA_0001439: ['thoracic vertebra 2', 'http://human.owl#NCI_C33502', 's5 vertebra', 0.25]
http://mouse.owl#MA_0000983: ['fourth ventricle choroid plexus', 'http://human.owl#NCI_C32308', 'choroid plexus of the fourth ventricle', 0.6666666666666666]
http://mouse.owl#MA_0001184: ['osseus spiral lamina', 'http://human.owl#NCI_C32917', 'lamina lucida', 0.25]
http://mouse.owl#MA_0002184: ['ocular angle vein', 'http://human.owl#NCI_C32393', 'costophrenic angle', 0.25]
http://mouse.owl#MA_0002095: ['superior cerebellar vein', 'http://human.owl#NCI_C33670'

0

In [80]:
# Check for overlapping classes
counter = 0
for class_name, values in string_matching_results.items():
    class_2 = values[1]
    if class_2:
        if matched_results_llm_with_labels[class_name][1] == class_2:
            counter += 1
        
counter

1305

In [92]:
def calc_score_for_matched_classes(matched_classes, metric, dict_sim_scores_llm = {}):
    matches_with_score = {}
    for class_name, values in matched_classes.items():
        label1 = values[0]
        class_2 = values[1]
        label2 = values[2]
        if label2:
            matching_score = 0
            if metric == "llm":
                if class_name in dict_sim_scores_llm:
                    matching_score = dict_sim_scores_llm[class_name][class_2]
            else:
                matching_score = execute_string_matching(metric, label1, label2) # calculate string matching score
            matches_with_score[class_name] = [label1, class_2, label2, matching_score]
        else:
            matches_with_score[class_name] = [label1, class_2, "", 0]
    
    return matches_with_score

In [94]:
string_matches_for_llm = calc_score_for_matched_classes(matched_results_llm_with_labels, "Jaccard")
llm_matches_for_string = calc_score_for_matched_classes(string_matching_results, "llm", dict_similarity_scores_llm)

#### Final String matching lists

In [109]:
# Print the first entries from the dictionary
for onto1_class in list(string_matching_results.keys()):
    if string_matching_results[onto1_class][0] == "extrinsic auricular muscle":
        print(f"{onto1_class}: {string_matching_results[onto1_class]}")

http://mouse.owl#MA_0001230: ['extrinsic auricular muscle', 'http://human.owl#NCI_C32861', 'internal pterygoid muscle', 0.2]


In [110]:
# Print the first entries from the dictionary
for onto1_class in list(string_matches_for_llm.keys()):
    if string_matches_for_llm[onto1_class][0] == "extrinsic auricular muscle":
        print(f"{onto1_class}: {string_matches_for_llm[onto1_class]}")

http://mouse.owl#MA_0001230: ['extrinsic auricular muscle', 'http://human.owl#NCI_C33192', 'obturator internus muscle', 0.2]


#### Final LLM matching lists

In [112]:
# Print the first entries from the dictionary
for onto1_class in list(matched_results_llm_with_labels.keys()):
    if matched_results_llm_with_labels[onto1_class][0] == "extrinsic auricular muscle":
        print(f"{onto1_class}: {matched_results_llm_with_labels[onto1_class]}")

http://mouse.owl#MA_0001230: ['extrinsic auricular muscle', 'http://human.owl#NCI_C33192', 'obturator internus muscle', 0.6077027320861816]


In [111]:
# Print the first entries from the dictionary
for onto1_class in list(llm_matches_for_string.keys()):
    if llm_matches_for_string[onto1_class][0] == "extrinsic auricular muscle":
        print(f"{onto1_class}: {llm_matches_for_string[onto1_class]}")

http://mouse.owl#MA_0001230: ['extrinsic auricular muscle', 'http://human.owl#NCI_C32861', 'internal pterygoid muscle', 0.5757712721824646]


**In total we now have 4 lists**:
- one list calculated with the best 1-to-1 matches for String matching
- one list calculated with the best 1-to-1 matches for LLM matching
- one list calculated with the 1-to-1 matches of the LLM matching but with the String matching score
- one list calculated with the 1-to-1 matches of the String matching but with the LLM matching score

We **need 4 lists as otherwise there is no possibility to merge the results** of the two methods. The matched classes for the different methods (String and LLM) contain many different results: <br><br>For example:
- String matching: http://mouse.owl#MA_0001230 matched to http://human.owl#NCI_C32861
- LLM matching: http://mouse.owl#MA_0001230 matched to http://human.owl#NCI_C33192

So in order to merge them we need to calculate the scores of the resulting lists using the other method, thus resulting in 4 lists.

In [None]:
# check for http://mouse.owl#MA_0001424
# http://mouse.owl#MA_0001424 has label cervical vertebra 4
# currently matched with http://human.owl#NCI_C32245 => has label C7_Vertebra
# but in the reference.rdf matched with http://human.owl#NCI_C32242 => has label C4_Vertebra with score 1
# REPORT when running the matching again the http://mouse.owl#MA_0001424 gets matched to http://human.owl#NCI_C32431
# => which has a score of 0.16 => reason: because now the elements where iterated differently the before matched label got now matched with
# another class but with same score 0.25 => http://mouse.owl#MA_0001436: ['sacral vertebra 3', 'http://human.owl#NCI_C32245', 0.25]
print(jaccard_similarity("cervical vertebra 4", "C4_Vertebra"))
print(jaccard_similarity("thoracic vertebra 10", "C4_Vertebra"))

0.0
0.0


In [65]:
counter = 0
unique_values = set()
for key, element in string_matching_results.items():
    if element[1] == "http://human.owl#NCI_C32245":
        print(key)
        print(element)
    if not element[2] == 0 and not element[2] == 1:
        counter += 1
        unique_values.add(element[2])
# unique_values

http://mouse.owl#MA_0001444
['thoracic vertebra 7', 'http://human.owl#NCI_C32245', 0.25]


In [None]:
# TODO implement combining and filtering
# TODO implement user inputs (also weighted average with user defined formular would be nice)

Output format example:
```
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns="http://knowledgeweb.semanticweb.org/heterogeneity/alignment"
	 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
	 xmlns:xsd="http://www.w3.org/2001/XMLSchema#">

<Alignment>
<xml>yes</xml>
<level>0</level>
<type>??</type>

<map>
	<Cell>
		<entity1 rdf:resource="http://mouse.owl#MA_0002401"/>
		<entity2 rdf:resource="http://human.owl#NCI_C52561"/>
		<measure rdf:datatype="xsd:float">1.0</measure>
		<relation>=</relation>
	</Cell>
</map>
```